Text mining and analytics were conducted on a dataset of scraped Reliefweb.int (RW) articles on Ukraine. Articles were limited to the year 2022 and the English language. A total of 3,895 documents were scraped, tokenised and all stop words (the, a, we, can) were removed.
A brief examination of the most common word pairs in RW articles on Ukraine yields this network graph. Only more common word pairs have been included and the thickness of the line between them indicates the number of times this pair appears in the corpus.
This is all very much expected.Titles only give us the merest of
glimpses into the response and provides us no further knowledge besides
could be obtained from casually watching the news. Whilst
Ukraine is obviously central to the corpus, as are
war and situation report we see
smaller, but quite meaningful clusters. These include
fiscal and fy as well as
snapshot, funding and appeal.
However, we see that most of the corpus deals with situation updates
or press releases (see the centrality of media and
government). Relatively limited is information on
achievements – estimated, reached,
assistance, cash and relief do
not form a large proportion of the titles.
If we take a look at the most common word pairs in the text of the scraped articles, it doesn’t provide too much additional detail. As with the graph above, the thickness of the line indicates the nubmer of co-occurences.
If I knew nothing about the situation in Ukraine, from this graph, I
can glean that there is a war and a humanitarian response to it. I see
multiple sectors being mentioned, as well as refugees. The word
million shows up, as does scale.
I suppose this would work as a primer in other emergencies, but we will need a way to sort through the boilerplate (we will get to this later).
Let us, finally, take a macro-view of the dataset and plot, in a bit more detail, the correlations between word pairs – bigrams – within the corpus. So that we may get a lay of the land, so to speak.
This network graph is not only much more complex, but it is also formed of word pairs – bigrams – as this tends to improve interpretability at the cost of sensitivity. However, now the main patterns in the RW corpus are visible. This is the lay of the land, so to speak.